-
Notifications
You must be signed in to change notification settings - Fork 5
Pushing update for MetaCAT #155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Includes changes to data_utils
mart-r
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a question of whether we should change the API for the encode_category_values method. Especially since it's used by other parts (e.g trainer) as well. At the very least, I'd like to have some test that makes sure it has consistent behaviour. But ideally, I'd like to keep API stable if we can.
The other comment I've left is with the indentation within the method. After all, hard to read and/or maintain code is one of the reasons we had (at least one of) the issues before. So in order to avoid the same issue happening again, let's split this long method into multiple parts - each with its own separate scope and responsibility.
| return data_sampled | ||
|
|
||
|
|
||
| def encode_category_values(data: Dict, existing_category_value2id: Optional[Dict] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is used by the trainer:
| data, _, _ = encode_category_values(data, existing_category_value2id=category_value2id) |
Now, it looks like this change doesn't change the API in a way that would break that (at least not immediately). However, I'd like to have some stability in our API.
Perhaps a test for this method to make sure the behaviour is consistent?
Creating helper functions for checking alternative class names and undersampling data
Changes for flake8
mart-r
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please add type hints as well.
You've already described the types in the doc strings, so adding them to the signature shouldn't be that much extra work.
mart-r
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good!
* Pushing update for metacat Includes changes to data_utils * Update data_utils.py * Update data_utils.py * Update data_utils.py Creating helper functions for checking alternative class names and undersampling data * Update data_utils.py * Update data_utils.py Changes for flake8 * Update data_utils.py * Update data_utils.py
This includes:
category_namedoes not match the data (and the alternatives), ensuring the correctcategory_nameis shown in the exception instead ofNonecategory_value2idto ensure it covers all possible variations, including partially filled mapping (with incorrect class names)nclassesand the number of classes found in data, in case of mismatch, exception is raised as model loading needs to be performed again